A Review Paper about Deep Learning for Medical Image Analysis

Medical imaging refers to the process of obtaining images of internal organs for therapeutic purposes such as discovering or studying diseases. The primary objective of medical image analysis is to improve the efficacy of clinical research and treatment options. Deep learning has revamped medical image analysis, yielding excellent results in image processing tasks such as registration, segmentation, feature extraction, and classification. The prime motivations for this are the availability of computational resources and the resurgence of deep convolutional neural networks. Deep learning techniques are good at observing hidden patterns in images and supporting clinicians in achieving diagnostic perfection. It has proven to be the most effective method for organ segmentation, cancer detection, disease categorization, and computer-assisted diagnosis. Many deep learning approaches have been published to analyze medical images for various diagnostic purposes. In this paper, we review the work exploiting current state-of-the-art deep learning approaches in medical image processing. We begin the survey by providing a synopsis of research works in medical imaging based on convolutional neural networks. Second, we discuss popular pretrained models and general adversarial networks that aid in improving convolutional networks' performance. Finally, to ease direct evaluation, we compile the performance metrics of deep learning models focusing on COVID-19 detection and child bone age prediction.


Introduction
Computer-aided diagnosis (CAD) has emerged as one of the most important research fields in medical imaging. In CAD, machine learning algorithms are often utilized to examine the imaging data from historical samples of patients and construct a model to assess the patient's condition [1]. The developed model assists clinicians in making quick decisions. The most common imaging modalities used in medical applications are X-ray, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound. The sole aim of medical image processing would be to improve the interpretability of the information illustrated [2]. The following are the main categories of medical image analysis: enhancement, registration, segmentation, classification, localization, and detection [3].
Earlier, medical images were processed using low-level methods, such as thresholding, region growing, and edge tracing [4]. Meanwhile, the growth in size and scope of medical imaging data has fueled the evolution of machine learning techniques in medical image analysis. However, since such methods rely on handcrafted features, algorithm design requires manual effort. These constraints of conventional machine learning approaches have risen to the notion of artificial neural networks (ANNs). Factors such as data availability and computational processing capabilities facilitate the deepening of ANNs [5]. The emergence of deep learning techniques like convolutional neural networks has widened the possibilities for the automation of medical image processing.
A convolutional neural network (CNN) is a class of neural networks meant to handle pixel values. CNN makes image classification more scalable by employing linear mathematical concepts to detect patterns inside an image. While traditional CNN architectures consisted solely of convolutional layers placed on top of one another, modern architectures such as Inception, ResNet, and DenseNet come up with new and innovative approaches to building convolutional layers in a way that makes learning more efficient [6].
CNN can be employed as a feature extractor as well. Feature extraction aims to convert raw pixel data into numerical features that can be processed while keeping the information in the original data set. Traditional feature extractors can be replaced with CNNs, which can extract complex features that express the image in much more detail. The resulting features are then fed into a classifier network or used by typical machine learning algorithms for classification [7].
Despite the fact that deep CNN architectures exhibit cutting-edge performance on computer vision problems, there are some concerns about using CNN in the radiology field. In 2014, Goodfellow et al. discovered that introducing a little bit of noise to the original information can readily deceive neural networks into misclassifying items [8]. Furthermore, since the efficiency of deep learning is often based on the volume of input data, CNN requires large-scale, wellannotated radiology images. Building such databases in the medical industry, on the other hand, is costly and laborintensive.
In this study, we summarize the current developments in deep learning approaches for medical image analysis. The paper is organized as follows: first, survey papers related to medical image analysis are discussed in Section 2. Then, in Section 3, CNN models employed in the radiology field and approaches for improving CNN performance are described. Following that, the finding of models aimed at detecting COVID-19 and predicting child bone age are reviewed in Section 4. And finally, the conclusion is set out.

Related Works
This section discusses the survey papers on medical image analysis using deep learning-based algorithms. Hu et al. [9] described four deep learning architectures used for image analysis: CNN, fully convolutional networks (FCN), deep belief networks, and autoencoders. They also compiled recent works on cancer identification and diagnosis.
Liu et al. [10] concentrated on deep learning-based medical image segmentation. They began by explaining the deep learning framework deployed to segment medical images. Then, state-of-the-art segmentation architectures such as FCN, U-Net, and generative adversarial network (GAN) were examined.
Shin et al. [11] first studied two medical diagnostic problems, namely, interstitial lung disease detection and thoracoabdominal lymph node classification, using three CNN models: AlexNet, CifarNet, and GoogLeNet. Then looked at how transfer learning enhanced the performance of each model.
Kazeminia et al. [12] provided a broad insight into the current studies on GANs for medical applications, discussed the limitations and opportunities of the existing techniques, and elaborated on potential future work.
Fu et al. [13] highlighted the current advancements in medical image segmentation by deep learning. They divided the approaches reviewed into two main groups: pixel-bypixel classification and end-to-end segmentation, and discussed the performance, limits, and future potential of each group.

Medical Image Analysis Using Deep Learning
The primary focus of medical image analysis is to find out which regions of the anatomy are affected by the disease to aid physicians in learning about lesion progression. The analysis of a medical image is mostly reliant on four steps: (1) image preprocessing, (2) segmentation, (3) feature extraction, and (4) pattern identification or classification [14]. Preprocessing is used to remove unwanted distortions from images or improve image information for further processing. Segmentation refers to the process of isolating regions, such as tumors, and organs, for further study. The process of extracting precise details from the regions of interest (ROIs) that aid in their recognition is known as feature extraction. Based on extracted features, classification assists in categorizing the ROI.
We have compiled a list of research papers primarily concerned with segmentation and classification in medical imaging. Following the review of CNN, we have outlined some techniques for improving CNN's performance.

Convolutional Neural Network.
A CNN is a supervised deep learning framework that can accept images as input, allocate filters to convert image pixels into features, and apply those features to distinguish one data from another. It is generally composed of three layers: the convolutional layer, pooling layer, and fully connected layer. The convolutional layer is the initial layer of a convolutional network. After that, more convolutional layers or pooling layers can be added, with the fully connected layer being the last.
The convolutional block draws the features from the image, from which the network can analyze and obtain hidden correlations. Pooling layers are applied to reduce the size of the convolved features, referred to as downsampling. Fully-connected layers execute classification tasks depending on the features retrieved by the preceding layers. While convolutional layers generally adopt rectified linear unit (ReLu) function to activate neurons, Fully connected layers employ a softmax activation function or traditional machine learning classifiers (SVM, KNN, etc.) to classify inputs.

Overview of Works.
Despite the deep network's ability to extract features with more precision, it requires a lot of computing resources. Therefore, Badža and Barjaktarović [15] introduced a simple CNN model with two convolutional blocks for classifying brain tumors using MRI images. While evaluating 3064 MRI images, the model attained the best accuracy of 95.56% using 10-fold cross-validation. Rachapudi and Lavanya presented an efficient CNN architecture with a 22.7% error rate to classify the colorectal cancer histopathological images. To prevent overfitting, the model 2 Computational and Mathematical Methods in Medicine included five convolutional blocks, each containing a dropout layer [16]. The deep learning architecture for image segmentation comprises an encoder and a decoder. The encoder uses filters to extract features from the image, whereas the decoder is in charge of producing the final output, often a segmentation mask containing the object's shape. A fully convolutional network (FCN) is an encoder-decoder model that lacks dense layers in favor of 1 × 1 convolutions to serve the function of fully connected layers [17]. Sun et al. developed a 3D FCNN-based model for multimodal brain tumor image segmentation. The encoder had four pathways for extracting multiscale image features [18]. Then, these four feature maps were fused and fed to the decoder. By experimental validation on the brain tumor segmentation challenge dataset 2019 (BraTS2019), the model segmented the dataset with Dice.
Similarity Coefficient metrics (DSC) of 0.89, 0.78, and 0.76 for the complete, core, and enhanced tumor, respectively. In 2015, Ronneberger et al. introduced U-Net to deal with biomedical image segmentation that can learn from a small number of annotated medical images [19]. U-Net is a U-shaped encoder-decoder-based framework consisting of four encoder and four decoder blocks connected by skip connections. Dharwadkar and Savvashe employed U-Net architecture to design a ventricle segmentation model for heart MRI images. There are four layers in the original U-Net, but only three layers were employed in this model [20]. For the right ventricle segmentation challenge (RVSC) dataset, the proposed model obtained a dice score of 0.91.
For segmenting the left ventricle from cardiac CT angiography, Li et al. introduced U-Net with 8 layers. The exhibited U-Net model comprised eight encoder and eight decoder blocks. To further improve the network's efficiency, residual blocks in the form of skip connections were introduced into each encoder and decoder block [21]. The model was trained using 1600 CT images from 100 patients, resulting in a DSC of 0:9270 ± 139. Li et al. [22] introduced an attention mechanism between nested encoder-decoder paths in the U-Net++ [23] architecture to improve the understanding of the study area in liver segmentation. The model achieved a DSC of 98.15% through the experimental analysis of the liver tumor segmentation challenge dataset 2017 (LiTS2017).
V-Net extends U-Net by processing 3D MRI images with 3D convolutions [24]. Guan et al. developed a V-Net-based framework for separating brain tumors from 3D MRI brain images. In the developed framework, the squeeze and excite (SE) module and attention guide filter (AG) module were integrated into V-Net architecture to suppress irrelevant information and enhance segmentation accuracy [25]. When tested on the BraTS2020 dataset, the model obtained dice metrics of 0.68, 0.85, and 0.70 for the complete, core, and enhanced tumor, respectively.
Mask regional CNN is another CNN variant used in medical image segmentation. Mask R-CNN is a two-phase object identification and segmentation architecture. The first stage, known as the region proposal network (RPN), returns potential bounding boxes, whereas the second stage gener-ates the segmentation mask from each box [26]. Dogan et al. introduced a hybrid model combining U-Net and mask R-CNN for pancreas segmentation from CT images. The proposed system was composed of two parts: pancreas detection and pancreas segmentation. In pancreas localization, the region proposal network, in conjunction with the mask production network, was used to determine the bounding boxes of the pancreas portion, and the subregion centered by the rough pancreas region was sliced [27]. Finally, the cropped subregion was sent to U-Net for precise segmentation. The average DSC for the two-phase approach demonstrated on the 82 abdominal CT scans was 86.15%.
3.3. Improving the Performance of CNN. The CNN model is often used for image classification because it achieves better accuracy with a low error rate. However, it needs large datasets to generalize the hidden correlations found in the learning data. Here, we have discussed two approaches that may optimize the performance of CNN: (1) transfer learning and (2) general adversarial network (GAN).

Transfer Learning.
Transfer learning is an effective strategy to train a network with a limited dataset. Here, the model is pretrained using a large-scale dataset, like Ima-geNet having 1.4 million images divided into 1000 categories and then applied to the problem at hand [28]. The major pretrained CNN architectures for image classification are as follows: LeNet-5: LeNet-5 [29], a 7-level convolutional network presented by Lecun et al. in 1998, was the first of its kind. The model was designed to classify handwritten digits and tested on the MNIST standard dataset, with a classification accuracy of roughly 99.2% AlexNet: the network's design was quite similar to LeNet, but it was deeper, with more filters per layer. It contains five convolution layers and three fully-connected layers. To control overfitting, it employs a dropout mechanism in fully connected layers [30] Visual Geometry Group at Oxford (VGGNet): VGGNet typically consists of 16 layers with a lot of 3 × 3 filters of stride one [31]. It is now the most popular method for extracting features from images. VGGNet, on the other hand, has 138 million parameters, which are difficult to manage InceptionV1/GoogLeNet: the inception/GoogleNet architecture, presented by Christian Szegedy et al., has 22 layers. The Inception block does 1 × 1, 3 × 3, 5 × 5 convolutions, and 3 × 3 pooling at the input, and the outputs of these are stacked to send to the next inception module [32]. By using 1 × 1 convolutions in each module, GoogleNet can reduce the size of parameters to 4 million compared to AlexNet's 60 million Residual network or ResNet: a residual network, often known as ResNet, is a 152-layer model. This network employs a VGG-19-inspired network design, with grouped convolutional layers followed by no pooling in between and an average pooling before the fully connected output layer [33]. The design is converted into a residual network by adding shortcut connections. This sort of skip connection   Table 1. LeNet is a popular CNN model because of its simple architecture and shorter training time. Deep neural network models use the concept of the maxpooling layer to extract the most relevant features from a region. However, in medical image analysis, where quality is poor, pixels with lower intensities may hold critical information. Hence, Hazarika et al. introduced the minimum pooling layer in LeNet for Alzheimer's disease (AD) classification. In the modified LeNet [34], the min-pooling and max-pooling layers were merged, and the resulting layer replaced all maxpooling layers. According to the experimental study on 2000 brain images, the original LeNet model classified AD with 80% accuracy, while the revised LeNet attained an accuracy of 96.64%.
Hosny et al. introduced a fine-tuned AlexNet model to categorize skin lesions into seven classes using skin images. In the proposed architecture, the last three layers were replaced by new layers to make them suitable for classifying seven types of skin lesions [35]. The parameters of these new layers were initially set at random and then modified during the training. After training on 10,015 images, the model achieved an accuracy of 98.70% and a sensitivity of 95.60%. Dulf et al. trained and assessed five different models, including GoogleNet, AlexNet, VGG16, VGG19, and Incep-tionV3, to determine the best model for classifying the eight categories of colorectal polyps. The main criteria for adopt-ing the network were sensitivity and F1-score [36]. Hence, InceptionV3 was chosen with an F1-score of 98.14% and a sensitivity of 98.13%. In InceptionV3 [37], the 5 × 5 convolutional layer is replaced with two 3 × 3 convolutional layers to lower the computational cost.
Hameed et al. demonstrated an ensemble deep learning strategy to categorize breast cancer into carcinoma and noncarcinoma using histopathology images. In this case, VGG models, namely VGG16 and VGG19, were used to design the framework. VGG19 has the same basic architecture as VGG16 with three additional convolutional layers. Besides the first block, the remaining four blocks were updated during training to fine-tune the models [38]. Finally, the tuned VGG16 and VGG19 models were ensembled, resulting in an overall accuracy of 95.29%.
Togacar et al. used both VGG16 and AlexNet to extract features for brain tumor classification from MRI images, where each model captured 1000 features [39]. Then, using the recursive feature elimination (RFE) feature selection algorithm, the obtained features were evaluated to identify the most efficient features. Finally, the SVM classifier gave 96.77% accuracy with 200 chosen features. Eid and Elawady presented a ResNet-based SVM for pneumonia detection using X-rays. The developed model preferred ResNet to get features from chest X-rays, then used a boosting algorithm to choose the relevant features and an SVM classifier to detect pneumonia based on those features [40]. The model had 98.13% accuracy after being trained on 5,863 X-rays.
Xiao et al. used a Res2Net-based 3D-UNet to segment the left ventricle from echocardiography images. To extract 3D features at multiple scales, the basic residual unit in Res2Net was replaced with a set of 3 × 3 × 3 filters [41]. Finally, a group of 1 × 1 × 1 filters merged feature maps from all groups. According to an experimental analysis of 1186 lung images from the Lung Nodule Analysis dataset 2016 In the proposed work, InceptionResNetV2 was adopted as the CNN network to segment the kidneys. Then, to refine the segmentation result, postprocessing procedures such as eliminating any voxel that was not associated with the kidney and fill operation were performed [42]. The proposed model got a mean dice score of 0.904 after being evaluated with 100 scans.
3.6. Generative Adversarial Network. Goodfellow et al. introduced the generative adversarial network (GAN), a type of neural network meant for unsupervised learning. GANs generally are of two competing neural network models: a generator that creates new data samples that mimic training data and a discriminator that differentiates training data from the generator's output [43].

3.7.
Overview of Works. GAN-based methods used in medical image analysis are listed in Table 2. Cirillo et al. introduced a 3D GAN to segment brain tumors using MRI images from the BraTS2020 dataset. The U-Net architecture-based generator resulted in the segmented tumor region. The GAN discriminator was given a 3D MRI image and its segmentation output from the generator as input and generated a precise segmentation mask [44]. The GAN model segmented the whole, the core, and the enhanced tumor with average dice scores of 87.20%, 81.14%, and 78.67%, respectively. Wang et al. developed a U-Net segmentation network and a discriminant network with multiscale feature extraction to enhance prostate segmentation accuracy [45]. The approach obtained a DSC value of 91.66% by demonstrating it on 220 MRI images. Wei et al. used a combination of GAN and Masks R-CNN to segment the liver from CT images. In the improved mask R-CNN, the k-means algorithm was utilized to adjust the bounding box parameters using a Euclidean distance [46]. The GAN-based approach yielded an average DSC of 95.3% while evaluating 378 CT images. A V-Net and Wasserstein GAN-based model was explored by Ma et al. [47] to improve the efficiency of liver segmen-tation. The WGAN [48] model includes Wasserstein distance to fix the issue of GAN training instability. On two abdominal CT scan datasets, LiTS and CHAOS, the method achieved DSC of 92% and 90%, respectively. Zhang et al. proposed dense GAN coupled with the U-Net to separate lung lesions from COVID-19 CT images. A dense block with five layers [49] was introduced into the discriminator network to make the model more compact. The proposed model got a mean dice score of 0.683 when tested on 100 lung CT images.
GAN can also be used for data augmentation [65] (i.e., creating plausible examples to add to a dataset) to boost classifier accuracy. GAN was also used [66] to generate realistic skin cancer images. The generator generated high-quality training data, and the discriminator tried to distinguish the original data from the generator's data. Ahmad et al. developed an auxiliary GAN framework to assess the accuracy of skin cancer categorization. First, the variational autoencoder network was trained to obtain the latent noise vector, and the generator produced skin lesion samples from this informative noise vector [67]. The GAN used here not only decided whether the image was original or not but also predicted the image's class label with 92.5% accuracy.

Discussion
To make straightforward comparisons, we have summarized the outcomes of the papers based on COVID-19 identification and child bone age prediction.  -19) by leveraging the advancement of machine learning. Furthermore, it regularly releases deep learning models and benchmark datasets to keep up with the pandemic [79]. In response to this initiative, Wang et al. introduced COVID-Net, a deep CNN for COVID-19 identification from chest X-rays.
In the COVID-Net model, residual projection-expansionprojection-extension (PEPX) blocks which comprise four 1 × 1 convolutions, were introduced to enhance the efficiency  [80]. The model efficacy was verified using 13,975 X-ray images from the COVIDx dataset. According to the experimental findings, the model attained a precision of 98.9% and a recall of 91.0% to detect covid-19. The COVIDx is a large-scale dataset of chest X-ray images compiled from publicly available data sources. As of now, COVIDx consists of 30,882 X-ray images from 17,026 patients. Table 3 shows the deep learning models that were employed for COVID-19 detection. We can observe from the methods studied that the classification accuracy is affected not only by the CNN model chosen but also by the size of the dataset, the type of modality, the data augmentation techniques, and the features opted for processing.

RSNA Pediatric Bone Age Challenge 2017.
In 2017, the Radiological Society of North America (RSNA) held a contest to predict the children's bone age from the hand X-rays. The main goal of this challenge was to encourage people to develop machine learning models that could accurately estimate bone age from pediatric hand X-rays. The performance measure was the mean absolute error in months, the average absolute difference between predicted results and ground-truth bone age [81]. The bone age dataset [82], consisting of 14,236 left-hand X-ray images, was divided into a training set, a validation set, and a test set of 12611, 1425, and 200, respectively. Table 4 summarizes some of the recent CNN-based methods that used the RSNA bone age benchmark dataset. The approaches stated were divided into two phases. The CNN model was used in the initial stage to carve up the hand region from the X-ray images. The second phase included a pre-trained model for extracting inherent features from the hand region and a regression layer to estimate bone age.

Conclusion
We have presented a detailed overview of newly published deep learning-based methods from 2019 to 2022 in medical imaging. Recent advances in deep learning architectures have the ability to boost diagnostic precision in medical imaging. On the other hand, deep learning necessitates a large volume of data to outperform traditional machine learning models. In practice, however, obtaining such datasets containing medical images is difficult. Transfer learning via pretrained models can help to solve this problem. There is a clear tendency toward modifying pretrained models to make them more appropriate for a specific task. This popularity is because pretrained models expedite training while   6 Computational and Mathematical Methods in Medicine ensuring good classification accuracy. Another trend is to employ GAN to enhance segmentation accuracy due to its capacity to generate high-quality medical images and imitate input data distribution. The GAN-based approaches have proven to be effective in resolving discrepancies between ground truth and model-generated segmentation masks. Also, GAN's ability to synthesize data can help solve difficulties such as lack of medical images or imbalanced data distribution, resulting in improved classification model performance.

Data Availability
The data that support the findings of this study are openly available in with in the reference list.

Conflicts of Interest
The authors declare that they have no conflicts of interest.